fix: resolve captureTurn when app-server disconnects mid-turn#243
fix: resolve captureTurn when app-server disconnects mid-turn#243ruan330 wants to merge 1 commit intoopenai:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9b4d141c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| appClient.exitPromise.then(() => { | ||
| shutdown(server).catch(() => {}).finally(() => process.exit(1)); |
There was a problem hiding this comment.
Ignore intentional broker shutdowns in exit watcher
appClient.exitPromise resolves on both upstream crashes and normal appClient.close() calls. In this script, all intentional termination paths (broker/shutdown, SIGTERM, and SIGINT) call shutdown(server), which closes appClient, so this watcher also triggers and runs shutdown(server) again before forcing process.exit(1). That can turn graceful shutdowns into crash exits and cause shutdown reentrancy/races during normal broker teardown.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 14123d3. Added a shuttingDown flag set by shutdown() and checked in the watchdog, so intentional teardown paths no longer race. Regression test: broker exits cleanly on broker/shutdown without tripping the upstream-exit watchdog.
Root cause
----------
`captureTurn` (plugins/codex/scripts/lib/codex.mjs) blocks on
`state.completion`. Two existing resolution paths both require server
cooperation:
A. `turn/completed` notification from Codex app-server
B. `finalAnswerSeen + scheduleInferredCompletion` (fires 250ms after
an `agentMessage` with `phase === "final_answer"`)
When the Codex CLI exits mid-turn — rate limit, OOM, internal error,
token limit — neither fires: the only `agentMessage` had `phase ==
"plan"` so `finalAnswerSeen` stayed `false`, and no `turn/completed`
was ever sent. `state.completion` hangs indefinitely. The node
companion process itself exits cleanly via IPC EOF, bypassing
`runTrackedJob`'s try/catch, leaving a zombie job record
(`status: "running"`) and no final verdict. Observed three times in
the wild on 2026-04-18 while running `/codex:adversarial-review` from
Claude Code.
In broker mode (`app-server-broker.mjs`) the bug was worse: the broker
did not watch the upstream `appClient`, so when the Codex CLI died the
broker kept accepting socket connections. Clients got neither
`turn/completed` nor a socket EOF.
Fix (three coupled changes, one semantic unit)
-----------------------------------------------
1. `captureTurn` subscribes to `client.exitPromise` as a third
resolution path. If a `final_answer` agentMessage has already been
captured AND no subagent work is outstanding, honor the existing
inferred-success semantics — call `completeTurn(state, null, {
inferred: true })` without the 250ms debounce, which is meaningless
on a dead socket. Only when no terminal signal was captured do we
synthesize a `failed` turn so `buildResultStatus` returns non-zero.
2. `app-server-broker.mjs` subscribes to the upstream
`appClient.exitPromise` and tears down the server when the Codex
CLI dies. Connected clients receive socket EOF and the watchdog in
openai#1 resolves their `captureTurn`. Broker exits with status 1 so
supervisors can restart it.
3. The broker watchdog skips when `shuttingDown` is true. Without this
guard, intentional teardown paths (`broker/shutdown`, `SIGTERM`,
`SIGINT`) all call `shutdown(server)` → `appClient.close()` which
resolves the same `exitPromise`, re-entering `shutdown()` and
racing `process.exit(0)` with `process.exit(1)`.
Why the obvious small fixes were rejected
------------------------------------------
- Relax `finalAnswerSeen` to any `lifecycle === "completed"`
agentMessage: changes semantics for well-behaved turns; can commit
prematurely on intermediate plan-phase messages.
- Only a max-turn timeout: loses `lastAgentMessage` diagnostic, fires
on legitimately long turns, and doesn't catch fast disconnects.
- Broker-side timeout alone: doesn't help direct-transport callers.
Tests
-----
Three regression tests in `tests/runtime.test.mjs` against new fake
behaviors in `tests/fake-codex-fixture.mjs`:
* `adversarial review resolves when app-server disconnects before
turn/completed` — plan-phase agentMessage then disconnect; asserts
non-zero exit + rendered output mentions the disconnect.
* `adversarial review preserves final_answer when app-server exits
before turn/completed` — final_answer agentMessage then
disconnect; asserts exit 0 + structured findings survive.
* `broker exits cleanly on broker/shutdown without tripping the
upstream-exit watchdog` — sends `broker/shutdown` RPC to a
non-detached broker, asserts exit 0.
Baseline suite: 88 tests, 82 pass, 6 fail pre-existing (status/result
caching in test state — unrelated). With this patch: 90 tests, 85 pass,
6 fail. +3 new passing tests, no regression.
Upstream pointer
----------------
Why the Codex CLI can exit mid-turn without flushing `turn/completed`
is orthogonal to this PR and belongs in `openai/codex`. The watchdog
here makes the companion resilient regardless.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b9b4d14 to
14123d3
Compare
Summary
captureTurnhangs indefinitely when the Codex app-server disconnects before sendingturn/completedor afinal_answer-phase agentMessage. The companion process exits via IPC EOF bypassingrunTrackedJob's try/catch, leaving a zombie job record (status: "running") and no final verdict. In broker mode the hang is worse: the broker stays alive with no upstream, so clients get neitherturn/completednor a socket EOF.This PR adds a watchdog on both transports, preserves the existing inferred-success semantics when Codex disconnects after a completed
final_answer, and covers the lifecycle with three regression tests.Observed in the wild
/codex:adversarial-reviewsessions on a user workstation hung three times in a row on 2026-04-18. Logs showed Codex CLI emitted exactly oneagentMessage(phaseplan), ran ~15 shell commands, then exited cleanly withoutturn/completed. Claude Code reportedexit code 0, but no final verdict was printed,/codex:statusreported the job asrunningindefinitely, and/codex:resultreturnedNo job found.Root cause is not adversarial-review specific — any
runAppServerTurncaller shaped like this hangs./codex:reviewusesrunAppServerReviewand is unaffected.Root cause
captureTurnblocks onstate.completionwhich only resolves via:A.
turn/completednotification with matchingthreadId, orB.
finalAnswerSeen + scheduleInferredCompletion(fires 250ms after anagentMessagewithphase === "final_answer")Both paths assume the Codex CLI cooperates and emits a terminal signal before disconnecting. When the CLI exits mid-turn (rate limit / OOM / internal error / token limit), neither triggers —
state.completionhangs forever.In broker mode,
app-server-broker.mjsforwards notifications fromappClientto socket clients but does not react toappClient.exitPromise. A dead upstream leaves the broker idle-alive; clients see neither a completion event nor a socket close.Fix (three coupled parts, one semantic unit)
1.
captureTurnwatchdog with success preservation (plugins/codex/scripts/lib/codex.mjs)Subscribe to
client.exitPromiseas a third resolution path. If afinal_answeragentMessage was already captured AND no subagent work is outstanding, honor the existing inferred-success semantics — callcompleteTurn(state, null, { inferred: true })without waiting the 250ms debounce (meaningless on a dead socket). Only when no terminal signal was captured do we synthesize afailedturn sobuildResultStatusreturns non-zero.This alignment is important: an earlier iteration of this PR unconditionally marked disconnect as
failed, which regressed the valid case of Codex sending a complete answer and then closing cleanly. The current shape matchesscheduleInferredCompletion's authority rule (finalAnswerSeen+ no pending work → success).2. Broker propagates upstream exit (
plugins/codex/scripts/app-server-broker.mjs)Subscribe to the upstream
appClient.exitPromise. When the Codex CLI dies, tear down the server (close sockets, unlink endpoint + pid files) andprocess.exit(1). Connected clients receive socket EOF; theircaptureTurnresolves via #1. Exit status 1 lets a supervisor restart the broker.3. Broker watchdog skips intentional shutdown
The watchdog in #2 gates on a
shuttingDownflag set byshutdown(). Without this guard, intentional teardown paths (broker/shutdown,SIGTERM,SIGINT) all callshutdown(server)→appClient.close()which resolves the sameexitPromisethe watchdog listens on — re-enteringshutdown()and racing the intentionalprocess.exit(0)withprocess.exit(1). This race turned every graceful shutdown into a spurious crash exit.Why the obvious small fixes were rejected
finalAnswerSeento anylifecycle === "completed"agentMessage: changes semantics for well-behaved turns, can commit prematurely on intermediate plan-phase messages.lastAgentMessagediagnostic, fires on legitimately long turns, and doesn't catch fast disconnects.Tests
Three regression tests in
tests/runtime.test.mjsdriven by new fake behaviors intests/fake-codex-fixture.mjs:disconnect-before-completionadversarial-reviewexits non-zero + output mentions disconnectfinal-answer-then-exitturn/completedbroker exits cleanly on broker/shutdownbroker/shutdownRPCAll three new tests pass. Existing-test pass rate on the same machine is identical with vs without this patch (i.e. the patch introduces zero new failures) — CI on this PR will be authoritative for the absolute pass count.
Review cycle
Two rounds of Codex adversarial review on the fix itself caught two high-severity races that the first iteration missed:
shutdown()resolves the sameexitPromisethe watchdog listens on, re-entering and forcingexit(1)duringSIGTERM/broker/shutdown. Flagged by GitHub Codex inline comment +/codex:adversarial-reviewfirst pass; addressed by Review fails due to unknown sandboxing variant #3.exitPromisehandler unconditionally markedfailed, overridingscheduleInferredCompletion's validfinal_answersuccess path when disconnect raced the 250ms debounce. Flagged by/codex:adversarial-reviewsecond pass; addressed by Add Gemini CLI extension commands #1.All findings addressed. The final squashed commit (
14123d3) receivedapproveverdict from a third adversarial review pass.Upstream pointer
Why the Codex CLI can exit mid-turn without flushing
turn/completedis orthogonal to this PR and belongs inopenai/codex. The watchdog here makes the companion resilient regardless.Test plan